Journals
  Publication Years
  Keywords
Search within results Open Search
Please wait a minute...
For Selected: Toggle Thumbnails
Cut-GAR: solution to determine cut-off point in cloud storage system
SHAO Tian, CHEN Guangsheng, JING Weipeng
Journal of Computer Applications    2015, 35 (9): 2497-2502.   DOI: 10.11772/j.issn.1001-9081.2015.09.2497
Abstract511)      PDF (864KB)(276)       Save
Considering poor performance caused by vague definition of small files in Hadoop Distributed File System (HDFS), Cut-off Point via Grey Relational Analysis (Cut-GAR) was presented to find the cut-off point between small files and large files, the relationship among the consumed memory of NameNode (M), speeds of MB of Uploaded Files per Second (MUFS), speeds of MB of Accessed Files per Second (MAFS) and file size was analyzed, the proper file sizes according to the three factors, were set respectively as FM, FMUFS and FMAFS. And then, grey relational analysis was taken to weight impacts of the three factors on file size while file size was treated as evaluated object, and M, MUFS and MAFS were employed as evaluated indexes, therefore the weight of evaluated index and relational degree of index-object were obtained. The outcome that the sum of FM, FMUFS, and FMAFS multiplied by the corresponding index weight was regarded the approximate optimal value of cut-off point. As experiment results demonstrate, Cut-GAR achieves a balance among M, MUFS, and MAFS, which improves the performance of small file processing.
Reference | Related Articles | Metrics
Implementation of decision tree algorithm dealing with massive noisy data based on Hadoop
LIU Yaqiu, LI Haitao, JING Weipeng
Journal of Computer Applications    2015, 35 (4): 1143-1147.   DOI: 10.11772/j.issn.1001-9081.2015.04.1143
Abstract585)      PDF (750KB)(587)       Save

Concerning that current decision tree algorithms seldom consider the influence of the level of noise in the training set on the model, and traditional algorithms of resident memory have difficulty in processing massive data, an Imprecise Probability C4.5 algorithm named IP-C4.5 was proposed based on Hadoop. When training model, IP-C4.5 algorithm considered that the training set used to design decision trees is not reliable, and used imprecise probability information gain rate as selecting split criterion to reduce the influence of the noisy data on the model. To enhance the ability of dealing with massive data, IP-C4.5 was implemented on Hadoop by MapReduce programming based on file split. The experimental results show that when the training set is noisy, the accuracy of IP-C4.5 algorithm is higher than that of C4.5 and Complete CDT (CCDT), especially when the data noise degree is more than 10%, it has outstanding performance; and IP-C4.5 algorithm with parallelization based on Hadoop has the ability of dealing with massive data.

Reference | Related Articles | Metrics